Syntactic methods for topic-independent authorship attribution

نویسندگان

  • Johanna Björklund
  • Niklas Zechner
چکیده

The efficacy of syntactic features for topic-independent authorship attribution is evaluated, taking a feature set of frequencies of words and punctuation marks as baseline. The features are ‘deep’ in the sense that they are derived by parsing the subject texts, in contrast to ‘shallow’ syntactic features for which a part-of-speech analysis is enough. The experiments are conducted on a corpus of novels written around the year 1900 by 20 different authors, and cover two tasks. In the first task, text samples are taken from books by one author, and the goal is to pair samples from the same book. In the second task, text samples are taken from several authors, but only one sample from each book, and the goal is to pair samples from the same author. In the first task, the baseline feature set outperformed the syntax-based feature set, but for the second task, the outcome was the opposite. This suggests that, compared to lexical features such as vocabulary and punctuation, syntactic features are more robust to changes in topic.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Syntactic Stylometry: Using Sentence Structure for Authorship Attribution

Most approaches to statistical stylometry have concentrated on lexical features, such as relative word frequencies or type-token ratios. Syntactic features have been largely ignored. This work attempts to fill that void by introducing a technique for authorship attribution based on dependency grammar. Syntactic features are extracted from texts using a common dependency parser, and those featur...

متن کامل

Lost in Translation: Authorship Attribution using Frame Semantics

We investigate authorship attribution using classifiers based on frame semantics. The purpose is to discover whether adding semantic information to lexical and syntactic methods for authorship attribution will improve them, specifically to address the difficult problem of authorship attribution of translated texts. Our results suggest (i) that frame-based classifiers are usable for author attri...

متن کامل

Domain Independent Authorship Attribution without Domain Adaptation

Automatic authorship attribution, by its nature, is much more advantageous if it is domain (i.e., topic and/or genre) independent. That is, many real world problems that require authorship attribution may not have in-domain training data readily available. However, most previous work based on machine learning techniques focused only on in-domain text for authorship attribution. In this paper, w...

متن کامل

Automatic Authorship Detection Using Textual Patterns Extracted from Integrated Syntactic Graphs

We apply the integrated syntactic graph feature extraction methodology to the task of automatic authorship detection. This graph-based representation allows integrating different levels of language description into a single structure. We extract textual patterns based on features obtained from shortest path walks over integrated syntactic graphs and apply them to determine the authors of docume...

متن کامل

Shallow Text Analysis and Machine Learning for Authorship At- tribution

Current advances in shallow parsing and machine learning allow us to use results from these fields in a methodology for Authorship Attribution. We report on experiments with a corpus that consists of newspaper articles about national current affairs by different journalists from the Belgian newspaper De Standaard. Because the documents are in a similar genre, register, and range of topics, toke...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Natural Language Engineering

دوره 23  شماره 

صفحات  -

تاریخ انتشار 2017